class: center, middle, inverse, title-slide # Data Management, Recoding, and the tidyverse ## Tutorial for Applied Statistics with R ### Mariana Patino
m.patino@zeppelin-university.net
### 2019-02-20 --- <!--self_contained: true https://github.com/yihui/xaringan/issues/64 https://github.com/yihui/xaringan/issues/10 https://github.com/gnab/remark/issues/194 --> <!--https://stackoverflow.com/questions/48976338/how-can-i-modifiy-the-positions-of-the-text-and-logo-on-rmarkdown-title-slide/49018455#49018455!--> # Outline 1. Using `dplyr` to manipulate data --- class: inverse, center, middle # Data Manipulation with dplyr <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- class: inverse, center, middle <img src="pictures/dplyr_logo.png" width="50%" class="centerimg"> --- # dplyr Overview `dplyr` is a package for consistent data manipulation of (tidy) data - `filter()`: filter observations/rows - `arrange()`: arrange (sort) by a variable/column - `select()`: select a variable/column - `mutate()`: change or create values of a variable - `summarise()`: summarise the dataset into a single observation - complemented by `group_by()` Further information: http://dplyr.tidyverse.org/ Load `dplyr` with the tidyverse ```r library(tidyverse) ``` ``` ## Warning: package 'ggplot2' was built under R version 3.5.2 ``` ``` ## Warning: package 'tibble' was built under R version 3.5.2 ``` ``` ## Warning: package 'tidyr' was built under R version 3.5.2 ``` ``` ## Warning: package 'dplyr' was built under R version 3.5.2 ``` --- # Recap Tidy Data <img src="pictures/tidy_data.png" width="100%" class="centerimg"> <small><small>Source: http://r4ds.had.co.nz/tidy-data.html</small></small> --- # The Dataset: 336k Flights from NYC in '13 .smaller[ ```r # install.packages("nycflights13") library(nycflights13) ``` ``` ## Warning: package 'nycflights13' was built under R version 3.5.2 ``` ```r flights %>% glimpse() ``` ``` ## Observations: 336,776 ## Variables: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,... ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55... ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60... ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2... ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7... ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7... ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -... ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",... ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79... ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN... ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"... ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"... ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138... ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94... ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,... ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013... ``` ] --- # Filter Observations ``` ## Warning: `data_frame()` is deprecated, use `tibble()`. ## This warning is displayed once per session. ``` `filter()` the observations (rows) of a dataset for certain values. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> y </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> y </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% filter(var == "x") ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> x </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Filter Example <mark>Query</mark>: Find the long-distance flights (> 2500 miles) that departed in spring 2013. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r flights %>% filter(month <= 03 & distance > 2500) ``` ``` ## # A tibble: 2,916 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 558 600 -2 923 ## 2 2013 1 1 611 600 11 945 ## 3 2013 1 1 655 700 -5 1037 ## 4 2013 1 1 729 730 -1 1049 ## 5 2013 1 1 734 737 -3 1047 ## 6 2013 1 1 745 745 0 1135 ## 7 2013 1 1 746 746 0 1119 ## 8 2013 1 1 803 800 3 1132 ## 9 2013 1 1 826 817 9 1145 ## 10 2013 1 1 857 900 -3 1516 ## # ... with 2,906 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` ] --- # Arrange Observations `arrange()` the observations (rows) of a dataset for certain values. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% arrange(var1, desc(var2)) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Arrange Example <mark>Query</mark>: Sort the flights by day of the month (ascending) and departure-delay (descending). <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r flights %>% arrange(day, desc(dep_delay)) ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 848 1835 853 1001 ## 2 2013 12 1 657 1930 687 1010 ## 3 2013 5 1 9 1655 434 308 ## 4 2013 1 1 2343 1724 379 314 ## 5 2013 8 1 2311 1659 372 117 ## 6 2013 3 1 1528 920 368 1738 ## 7 2013 7 1 1602 959 363 1739 ## 8 2013 3 1 1449 855 354 1701 ## 9 2013 7 1 2118 1525 353 2309 ## 10 2013 7 1 1410 820 350 1558 ## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` ] --- # Select Variables `select()` certain variables/columns of a dataset and/or rename the variable<sup>1</sup>. .footnote[ .smaller[ [1] If you don't want to drop unused variables, use `rename()`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> var2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var3 </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% select(var1, foo = var3) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> foo </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Select Example <mark>Query</mark>: Select the carrier and the tail-number of the flights. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r flights %>% select(carrier, tail_number = tailnum) ``` ``` ## # A tibble: 336,776 x 2 ## carrier tail_number ## <chr> <chr> ## 1 UA N14228 ## 2 UA N24211 ## 3 AA N619AA ## 4 B6 N804JB ## 5 DL N668DN ## 6 UA N39463 ## 7 B6 N516JB ## 8 EV N829AS ## 9 B6 N593JB ## 10 AA N3ALAA ## # ... with 336,766 more rows ``` ] --- # Mutate Variables `mutate()` (change or create new) variables using window functions<sup>1</sup><sup>2</sup>. .footnote[ .smaller[ [1] A window function means that the number of observations remains unchanged, the opposite happens in summarise functions, where the number of observations is reduced. [2] If you want to drop unused variables, use `transmute()`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F6A788;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F6A788;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% mutate(var = var * 2, foo = 1:n()) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> foo </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #BF7565;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #7F0000;width: 15px;text-align:center;"> 6 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #CFD3E7;width: 15px;text-align:center;"> 2 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #BEB9D9;width: 15px;text-align:center;"> 3 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #AD9FCC;width: 15px;text-align:center;"> 4 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #BF7565;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9C85BE;width: 15px;text-align:center;"> 5 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 6 </td> </tr> </tbody> </table> ] --- # Mutate Example <mark>Query</mark>: Create a unique ID for each flight and compute the flight-distance in km (distance is reported in miles). <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r flights %>% mutate(id = 1:n(), dist_km = distance * 1.60934) %>% select(id, distance, dist_km) ``` ``` ## # A tibble: 336,776 x 3 ## id distance dist_km ## <int> <dbl> <dbl> ## 1 1 1400 2253. ## 2 2 1416 2279. ## 3 3 1089 1753. ## 4 4 1576 2536. ## 5 5 762 1226. ## 6 6 719 1157. ## 7 7 1065 1714. ## 8 8 229 369. ## 9 9 944 1519. ## 10 10 733 1180. ## # ... with 336,766 more rows ``` ] --- # Summarise Variables `summarise()` a dataset (compute a given summary of variables)<sup>1</sup>. .footnote[ .smaller[ [1] If the summarise function is applied to multipe variables, you can also use `summarise_if()`, `summarise_at()`, ... ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> var2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F6A788;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #CFD3E7;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #BEB9D9;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #AD9FCC;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9C85BE;width: 15px;text-align:center;"> 5 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F6A788;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 6 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% summarise( mu1 = mean(var1), min2 = min(var2), max2 = max(var2) ) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> mu1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> min2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> max2 </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F6A788;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 6 </td> </tr> </tbody> </table> ] --- # Summarise Example <mark>Query</mark>: Find the minimum, average, and maximum arrival-delay for all flights. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ .left-col[ ```r flights %>% summarise( min_delay = min(arr_delay), avg_delay = mean(arr_delay), max_delay = max(arr_delay) ) ``` ``` ## # A tibble: 1 x 3 ## min_delay avg_delay max_delay ## <dbl> <dbl> <dbl> ## 1 NA NA NA ``` ] ] -- .semi-small[ .right-col[ ```r flights %>% * filter(!is.na(arr_delay)) %>% summarise( min_delay = min(arr_delay), avg_delay = mean(arr_delay), max_delay = max(arr_delay) ) ``` ``` ## # A tibble: 1 x 3 ## min_delay avg_delay max_delay ## <dbl> <dbl> <dbl> ## 1 -86 6.90 1272 ``` ] ] --- # Grouped Mutate Variables `group_by() %>% mutate()` (change or create new) variables using window functions **per group**<sup>1</sup>. .footnote[ .smaller[ [1] The output is still grouped, to ungroup use `df %>% ... %>% ungroup()`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> grp </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> v </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% group_by(grp) %>% mutate( v = mean(v), foo = 1:n() ) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> grp </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> v </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> foo </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 3 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #B6ACD2;width: 15px;text-align:center;"> 2 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 3 </td> </tr> </tbody> </table> ] --- # Grouped Mutate Example <mark>Query</mark>: For each flight, find the difference of the arrival-delay to the average arrival-delay of the respective airline (`carrier`). <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r flights %>% filter(!is.na(arr_delay)) %>% group_by(carrier) %>% mutate(delta_arr_delay = arr_delay - mean(arr_delay)) %>% select(carrier, delta_arr_delay) ``` ``` ## # A tibble: 327,346 x 2 ## # Groups: carrier [16] ## carrier delta_arr_delay ## <chr> <dbl> ## 1 UA 7.44 ## 2 UA 16.4 ## 3 AA 32.6 ## 4 B6 -27.5 ## 5 DL -26.6 ## 6 UA 8.44 ## 7 B6 9.54 ## 8 EV -29.8 ## 9 B6 -17.5 ## 10 AA 7.64 ## # ... with 327,336 more rows ``` ] --- # Grouped Summarise Variables `group_by() %>% summarise()` summarise variables **per group**<sup>1</sup>. .footnote[ .smaller[ [1] The output is still grouped, to ungroup use `df %>% ... %>% ungroup()`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> grp </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> var1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> var2 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E0ECF4;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #EF6548;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #CFD3E7;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FEE8C8;width: 15px;text-align:center;"> 1 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #BEB9D9;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #AD9FCC;width: 15px;text-align:center;"> 4 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9C85BE;width: 15px;text-align:center;"> 5 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 6 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r df %>% group_by(grp) %>% summarise( mu1 = mean(var1), max2 = max(var2) ) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> grp </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> mu1 </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> max2 </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: 15px;text-align:center;"> A </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F9BC9D;width: 15px;text-align:center;"> 2 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #AD9FCC;width: 15px;text-align:center;"> 4 </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #377EB8;width: 15px;text-align:center;"> B </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #F49072;width: 15px;text-align:center;"> 3 </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8C6BB1;width: 15px;text-align:center;"> 6 </td> </tr> </tbody> </table> ] --- # Grouped Summarise Example <mark>Query</mark>: For each airline (carrier), find the mean and meadian arrival over all flights of 2013. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ .left-col[ ```r flights %>% filter(!is.na(arr_delay)) %>% group_by(carrier) %>% summarise( mean_delay = mean(arr_delay), median_delay = median(arr_delay) ) ``` ] .right-col[ ``` ## # A tibble: 16 x 3 ## carrier mean_delay median_delay ## <chr> <dbl> <dbl> ## 1 9E 7.38 -7 ## 2 AA 0.364 -9 ## 3 AS -9.93 -17 ## 4 B6 9.46 -3 ## 5 DL 1.64 -8 ## 6 EV 15.8 -1 ## 7 F9 21.9 6 ## 8 FL 20.1 5 ## 9 HA -6.92 -13 ## 10 MQ 10.8 -1 ## 11 OO 11.9 -7 ## 12 UA 3.56 -6 ## 13 US 2.13 -6 ## 14 VX 1.76 -9 ## 15 WN 9.65 -3 ## 16 YV 15.6 -2 ``` ] ] --- # Full Query Example <mark>Query</mark>: Find the 5 aircrafts (by tail number) that regained the most lost time (on average) and have at least 20 flights<sup>1</sup>. .footnote[ .smaller[ [1] Time regained can be calculated as the `regain = dep_delay - arr_delay`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- Part 1: Calculate the `regain`. .semi-small[ .left-col[ ```r flights %>% filter( !is.na(arr_delay), !is.na(dep_delay) ) %>% mutate( regain = dep_delay - arr_delay ) %>% select(tailnum, regain) ``` ] .right-col[ ``` ## # A tibble: 327,346 x 2 ## tailnum regain ## <chr> <dbl> ## 1 N14228 -9 ## 2 N24211 -16 ## 3 N619AA -31 ## 4 N804JB 17 ## 5 N668DN 19 ## 6 N39463 -16 ## 7 N516JB -24 ## 8 N829AS 11 ## 9 N593JB 5 ## 10 N3ALAA -10 ## # ... with 327,336 more rows ``` ] ] --- # Full Query Example <mark>Query</mark>: Find the 5 aircrafts (by tail number) that regained the most lost time (on average) and have at least 20 flights<sup>1</sup>. .footnote[ .smaller[ [1] Time regained can be calculated as the `regain = dep_delay - arr_delay`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> Part 2: Calculate the average `regain` per `tailnum`. .semi-small[ .left-col[ ```r flights %>% filter( !is.na(arr_delay), !is.na(dep_delay) ) %>% mutate( regain = dep_delay - arr_delay ) %>% group_by(tailnum) %>% summarise( n_flights = n(), regain = mean(regain) ) ``` ] .right-col[ ``` ## # A tibble: 4,037 x 3 ## tailnum n_flights regain ## <chr> <int> <dbl> ## 1 D942DN 4 0 ## 2 N0EGMQ 352 -1.47 ## 3 N10156 145 5.23 ## 4 N102UW 48 5.06 ## 5 N103US 46 3.74 ## 6 N104UW 46 8.33 ## 7 N10575 269 1.45 ## 8 N105UW 45 2.84 ## 9 N107US 41 5.27 ## 10 N108UW 60 5.47 ## # ... with 4,027 more rows ``` ] ] --- # Full Query Example <mark>Query</mark>: Find the 5 aircrafts (by tail number) that regained the most lost time (on average) and have at least 20 flights<sup>1</sup>. .footnote[ .smaller[ [1] Time regained can be calculated as the `regain = dep_delay - arr_delay`. ] ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> Part 3: Filter for the number of flights and take only the top five . .smaller[ .left-col[ ```r flights %>% filter( !is.na(arr_delay), !is.na(dep_delay) ) %>% mutate( regain = dep_delay - arr_delay ) %>% group_by(tailnum) %>% summarise( n_flights = n(), regain = mean(regain) ) %>% filter(n_flights > 20) %>% arrange(desc(regain)) %>% slice(1:5) ``` ] .right-col[ ``` ## # A tibble: 5 x 3 ## tailnum n_flights regain ## <chr> <int> <dbl> ## 1 N423AS 29 26.9 ## 2 N382HA 26 23.8 ## 3 N419AS 32 20.6 ## 4 N540AA 34 17.3 ## 5 N847VA 91 17.1 ``` ] ] --- # dplyr Summary We use `dplyr`/`tidyverse` to manipulate data with the following functions: - `filter()`: filter observations/rows - `arrange()`: arrange (sort) by a variable/column - `select()`: select a variable/column - `mutate()`: change or create values of a variable - `group_by() %>% mutate()`: change or create values of a variable per group - `summarise()`: summarise the dataset into a single observation - `group_by() %>% summarise()`: summarise the dataset per group More information: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html --- # Additional dplyr functions Useful helpers - `transmute()` like `mutate()` but drops unused variables - `rename()` like `select()`* but keeps unused variables - Helper functions for `select()`: `contains()`, `starts_with()`, `ends_with()`, `one_of()`, ... E.g., `df %>% select(starts_with("arr"))` - `mutate_at()` and `summarise_at()` mutates/summarises multiple variables - `mutate_if()` and `summarise_if()` mutates/summarises if a variable fulfills certain conditions CheatSheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf --- class: inverse, center, middle <img src="pictures/fruits.png" width="50%" class="centerimg"> --- # Bind Rows `bind_rows()` bind together two or more datasets by row. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .left-three[ <h3>Command</h3> <br> ```r bind_rows(df1, df2) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Bind Rows Example <mark>Task</mark>: Add two datasets together by rows. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r df1 <- data_frame(id = 1:2, name = c("Alice", "Bob")) df2 <- data_frame(id = 3:4, name = c("Charlie", "Dave")) bind_rows(df1, df2) ``` ``` ## # A tibble: 4 x 2 ## id name ## <int> <chr> ## 1 1 Alice ## 2 2 Bob ## 3 3 Charlie ## 4 4 Dave ``` ] --- # Bind Cols `bind_cols()` bind together two or more datasets by columns. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> .left-col[ <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .right-col[ <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] ] .left-three[ <h3>Command</h3> <br> ```r bind_cols(df1, df2) ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: 15px;text-align:center;"> . </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Bind Cols Example <mark>Task</mark>: Add two datasets together by columns. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r df1 <- data_frame(id = 1:2, name = c("Alice", "Bob")) df2 <- data_frame(sales = c(100, 95), region = c("North", "South")) bind_cols(df1, df2) ``` ``` ## # A tibble: 2 x 4 ## id name sales region ## <int> <chr> <dbl> <chr> ## 1 1 Alice 100 North ## 2 2 Bob 95 South ``` ] --- # Joins Join the superheroes and the address of the publishers together. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> .left-col[ <center><strong>Superheroes</strong></center> .smaller[ <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> superhero </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> alignment </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> publisher </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Batman </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> good </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> DC </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Joker </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> bad </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> DC </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> Prof X </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> good </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> Marvel </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FF7F00;width: ;text-align:center;"> Hellboy </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FF7F00;width: ;text-align:center;"> good </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FF7F00;width: ;text-align:center;"> Dark Horse </td> </tr> </tbody> </table> ] <br> <center><strong>Publishers</strong></center> .smaller[ <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> publisher </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> address </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> DC </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Burbank (CA) </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> Marvel </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> NYC (NY) </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FFFF33;width: ;text-align:center;"> Image Comics </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FFFF33;width: ;text-align:center;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] -- .right-col[ <center><strong>Joined Dataset</strong></center> .smaller[ <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> superhero </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> alignment </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> publisher </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #74ADD1;width: ;text-align:center;"> address </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Batman </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> good </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> DC </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Burbank (CA) </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Joker </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> bad </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> DC </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: ;text-align:center;"> Burbank (CA) </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> Prof X </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> good </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> Marvel </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: ;text-align:center;"> NYC (NY) </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: ;text-align:center;"> ??? </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: ;text-align:center;"> ??? </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: ;text-align:center;"> ??? </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #E41A1C;width: ;text-align:center;"> ??? </td> </tr> </tbody> </table> ] ] --- # Left Join `left_join()` to join two datasets together by a variable, taking all values from the "left" dataset (the one supplied first). <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .left-three[ <h3><center>Input</center></h3> <br> .left-col[ <center>"left" df</center> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> id </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> x </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] .right-col[ <center>"right" df</center> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> id </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> y </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FFFF33;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] ] .left-three[ <h3>Command</h3> <br> ```r left_join(df1, df2, by = "id") ``` ] -- .right-three[ <h3><center>Output</center></h3> <br> <table> <thead> <tr> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #FC9272;width: 15px;text-align:center;"> id </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #9E9AC8;width: 15px;text-align:center;"> x </th> <th style=" font-weight: bold; color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #DEEBF7;width: 15px;text-align:center;"> y </th> </tr> </thead> <tbody> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #4DAF4A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #252525;width: 15px;text-align:center;"> . </td> </tr> <tr> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #984EA3;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #A6D96A;width: 15px;text-align:center;"> . </td> <td style=" color: #000000;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #252525;width: 15px;text-align:center;"> . </td> </tr> </tbody> </table> ] --- # Left Join Example <mark>Join</mark> the address of the publishers to the superheroes (preserving all data in `superheroes`). .smaller[ ```r superheroes <- data_frame( superhero = c("Batman", "Joker", "Prof X", "Hellboy"), alignment = c("good", "bad", "good", "good"), publisher = c("DC", "DC", "Marvel", "Dark Horse") ) publishers <- data_frame( publisher = c("DC", "Marvel", "Image Comics"), address = c("Burbank (CA)", "NYC (NY)", "Portland (OR)") ) ``` ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=100%></html> -- .semi-small[ ```r left_join(superheroes, publishers, by = "publisher") ``` ``` ## # A tibble: 4 x 4 ## superhero alignment publisher address ## <chr> <chr> <chr> <chr> ## 1 Batman good DC Burbank (CA) ## 2 Joker bad DC Burbank (CA) ## 3 Prof X good Marvel NYC (NY) ## 4 Hellboy good Dark Horse <NA> ``` ] --- # All Joins and Set Operations .semi-small[ .left-col[ [R4DS Mutating Joins](http://r4ds.had.co.nz/relational-data.html#mutating-joins) > A mutating join [...] first matches observations by their keys, then copies across variables from one table to the other. - `left_join()` - `right_join()` - `inner_join()` - `full_join()` [R4DS Filtering Joins](http://r4ds.had.co.nz/relational-data.html#filtering-joins) > Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. - `semi_join()` - `anti_join()` ] .right-col[ [R4DS Set Operations](http://r4ds.had.co.nz/relational-data.html#set-operations) > All these operations work with a complete row, comparing the values of every variable. - `intersect()` - `union()` - `setdiff()` See also: https://github.com/gadenbuie/tidy-animated-verbs ] ] --- # Mutating Joins Animated Join two datasets together, taking all variables (columns) and selected observations (rows) based on the join function. .left-col[ `left_join()` <img src="pictures/animate_left_join.gif" width="70%" class="centerimg"> ] .right-col[ `right_join()` <img src="pictures/animate_right_join.gif" width="70%" class="centerimg"> ] https://github.com/gadenbuie/tidy-animated-verbs for animated version --- # Mutating Joins Animated .left-col[ `inner_join()` <img src="pictures/animate_inner_join.gif" width="70%" class="centerimg"> ] .right-col[ `full_join()` <img src="pictures/animate_full_join.gif" width="70%" class="centerimg"> ] https://github.com/gadenbuie/tidy-animated-verbs for animated version --- # Filtering Joins Animated .left-col[ `semi_join()` <img src="pictures/animate_semi_join.gif" width="70%" class="centerimg"> ] .right-col[ `anti_join()` <img src="pictures/animate_anti_join.gif" width="70%" class="centerimg"> ] https://github.com/gadenbuie/tidy-animated-verbs for animated version --- # Set Operations Animated .left-col[ `intersect()` <img src="pictures/animate_intersect.gif" width="70%" class="centerimg"> ] .right-col[ `setdiff()` <img src="pictures/animate_setdiff.gif" width="70%" class="centerimg"> ] https://github.com/gadenbuie/tidy-animated-verbs for animated version --- # Set Operations Animated .left-col[ `union()` <img src="pictures/animate_union.gif" width="70%" class="centerimg"> ] .right-col[ `union_all()` <img src="pictures/animate_union_all.gif" width="70%" class="centerimg"> ] https://github.com/gadenbuie/tidy-animated-verbs for animated version --- # More Data .left-col[ <img src="pictures/more_data.png" width="75%" class="centerimg"> ```r help(package = "nycflights13") ``` ] -- .right-col[ `nycflights13` hosts 5 different datasets that relate to each other. <img src="pictures/relational-nycflights.png" width="100%" class="centerimg"> <small><small>Source: http://r4ds.had.co.nz/relational-data.html</small></small> ] --- # Summary `tidyverse` makes our life easier among others with - `readr` for data input/output - `read_csv` and `write_csv` - `tidyr` for data cleaning and tidy data - `gather` and `spread` - `dplyr` for data manipulation - `filter`, `arrange`, `select`, `mutate`, and `summarise`, plain and `group_by` --- # More Information dplyr: - http://r4ds.had.co.nz/ - https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf - https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html